Add native gfx12 attention backend by jammm · Pull Request #368 · thu-ml/SageAttention

jammm · 2026-05-13T19:18:21Z

Summary

Adds a native ROCm gfx12 backend for SageAttention on RDNA4, including:

gfx12 native QK int8 attention paths for fp8 and fp16 value modes
fp8 support for D16/D64/D128 and fp16 support for D16/D64
automatic gfx12 runtime dispatch from the public sageattn API
internal sequence padding plus logical KV tail masking so non-64 sequence lengths do not fall back
HIP build integration for Windows/Linux ROCm PyTorch wheels
HIP compatibility fixes for fused/smooth_k support

Build

Windows

setup.py discovers ROCm through rocm-sdk, sets ROCM_HOME, adds the ROCm LLVM/bin paths, and defaults the Windows compiler settings to clang-cl. Users still need to run from an initialized Visual Studio shell.

# Activate Visual Studio environment
cmd /c '"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1 && set' | ForEach-Object { if ($_ -match '^([^=]+)=(.*)$') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process') } }

# Activate the virtual environment
.\venv\Scripts\Activate.ps1

pip install --no-build-isolation -v .

Optional cross-build target:

$env:PYTORCH_ROCM_ARCH = "gfx1201"
pip install --no-build-isolation -v .

Linux

pip install --no-build-isolation -v .

Optional cross-build target:

PYTORCH_ROCM_ARCH=gfx1201 pip install --no-build-isolation -v .

Correctness

Validated native gfx12 output against FlashAttention:

50/50 cases passed
fp8: D16/D64/D128
fp16: D16/D64
causal and non-causal
S64/S128/S1024
HND and NHD layouts
GQA Hq/Hkv = 8/2
bf16 fp8 path
smooth_k fp8/fp16 path

Additional runtime compatibility smoke:

ComfyUI Wan2.1 NHD cross-attention runs through the gfx12 native path without fallback:
- q: torch.Size([2, 14040, 12, 128])
- k/v: torch.Size([2, 512, 12, 128])
The fp8 NHD D128 non-causal smooth path avoids materializing a padded Q copy for tail shapes while preserving CUDA-compatible smooth-K behavior.
Quality checks against SageAttention v1 on fp8/fp16 NHD cases preserve output scale:
- Wan-style fp8 D128 tail: rel RMSE 0.036942, std ratio 0.999413, cosine 0.999318 versus v1
- fp16 D64: rel RMSE about 0.0086-0.0100, std ratio about 0.99997-1.00006 versus v1

Performance

Measured on gfx1201 / Radeon RX 9070 XT, B=1, H=32, S=1K/2K/4K/8K. FlashAttention comparison uses the installed FlashAttention package in the ROCm venv.

ComfyUI Wan2.1 fp8 Workload

Tested Wan2.1 1.4B with fp8 model weights on a Radeon RX 9070 XT:

--use-pytorch-cross-attention: 2.53 s/it diffusion steps, 120s total
--use-sage-attention: 1.78 s/it diffusion steps, 100s total

The diffusion steps run about 42% faster with the gfx12 SageAttention v2 kernels. The native gfx12 path uses int8 WMMA for QK and fp8 WMMA for PV, with compiled ISA containing v_wmma_i32_16x16x16_iu8 and v_wmma_f32_16x16x16_fp8_fp8.

In both runs, VAE decode is the current bottleneck because it exceeds the 16GB VRAM on the 9070 XT. Higher-VRAM RDNA4 cards should avoid this issue.

This was measured with MIOpen disabled, which is ComfyUI's current default behavior. With COMFYUI_ENABLE_MIOPEN=1, the first run took 265s overall and the second run completed in 84.5s.

Versus FlashAttention

Mode	Speedup vs FlashAttention
fp8 non-causal	1.20x-2.04x, geo 1.44x
fp8 causal	0.87x-1.73x, geo 1.27x
fp16 non-causal	0.94x-1.68x, geo 1.29x
fp16 causal	0.67x-1.45x, geo 1.03x

Most large-shape fp8/fp16 cases are faster than FlashAttention. Remaining short-shape gaps are mainly fp16 D64 causal at 1K-4K and fp8 causal D128 at 1K.

Additional focused tail-shape check after the smooth-K quality fix:

Shape	Mode	SageAttention gfx12	FlashAttention	Speedup
B=2, Sq=14040, Sk=512, H=12, D=128	fp8 non-causal NHD	0.884 ms	1.017 ms	1.15x

Versus SageAttention v1 fp8

Latest fp8 native-vs-v1 sweep, B=1, H=32, S=1K/2K/4K/8K, D=16/64/128:

Mode	Speedup vs SageAttention v1 fp8
fp8 non-causal	2.48x-9.38x, geo 4.46x
fp8 causal	2.47x-12.07x, geo 5.38x

jammm · 2026-05-14T10:44:05Z

@0xDELUXA if possible can you try this PR branch on comfyui ?

0xDELUXA · 2026-05-14T10:51:21Z

@0xDELUXA if possible can you try this PR branch on comfyui ?

Sure! I’m AFK right now, but I’ll try it later. Looking great so far!

trfmk123 · 2026-05-14T11:24:55Z

I am using this implementation of SageAttention with the Z-Image Turbo model on a 9070XT GPU.

Issue:
There is a noticeable drop in precision. Specifically, the output images suffer from "color fading" — the saturation is much lower than expected, and the overall image looks pale.

jammm · 2026-05-14T11:33:49Z

I am using this implementation of SageAttention with the Z-Image Turbo model on a 9070XT GPU.

Issue: There is a noticeable drop in precision. Specifically, the output images suffer from "color fading" — the saturation is much lower than expected, and the overall image looks pale.

Thanks for checking! does this happen with sageattention v1 too? or is it specific to this PR?

jammm · 2026-05-14T11:54:55Z

@trfmk123 just pushed a fix. Can you try again?

trfmk123 · 2026-05-14T12:04:19Z

@trfmk123 just pushed a fix. Can you try again?

Just tested it, and the image outputs for Z-Image Turbo are completely normal now.
Thanks a lot for the fix and your hard work!

jammm · 2026-05-14T13:20:17Z

@trfmk123 just pushed a fix. Can you try again?

Just tested it, and the image outputs for Z-Image Turbo are completely normal now. Thanks a lot for the fix and your hard work!

Thanks! Perf has regressed a bit due to the fix, but getting the quality right is the main thing. I'll try to do a perf pass.

trfmk123 · 2026-05-14T13:48:18Z

@trfmk123 just pushed a fix. Can you try again?

Just tested it, and the image outputs for Z-Image Turbo are completely normal now. Thanks a lot for the fix and your hard work!

Thanks! Perf has regressed a bit due to the fix, but getting the quality right is the main thing. I'll try to do a perf pass.

I suggest implementing v_scale scaling within the kernel. This would be a great way to fix the desaturation/precision issues without sacrificing any performance.

jammm · 2026-05-14T14:16:47Z

@trfmk123 just pushed a fix. Can you try again?

Just tested it, and the image outputs for Z-Image Turbo are completely normal now. Thanks a lot for the fix and your hard work!

Thanks! Perf has regressed a bit due to the fix, but getting the quality right is the main thing. I'll try to do a perf pass.

I suggest implementing v_scale scaling within the kernel. This would be a great way to fix the desaturation/precision issues without sacrificing any performance.

Good point. The gfx12 code currently takes raw fp8 weights without scaling which isn't desirable.

0xDELUXA · 2026-05-14T14:42:35Z

I'm using ROCm 7.13.0a20260504, and rocm-sdk init didn't populate _rocm_sdk_devel\lib\llvm\lib\clang\ (might be a local issue), so I had to set HIPCC_APPEND_FLAGS to point to _rocm_sdk_core\lib\llvm\lib\clang\23\include, otherwise the build fails with fatal error: '__clang_hip_runtime_wrapper.h' file not found.

jammm · 2026-05-14T14:51:02Z

I'm using ROCm 7.13.0a20260504, and rocm-sdk init didn't populate _rocm_sdk_devel\lib\llvm\lib\clang\ (might be a local issue), so I had to set HIPCC_APPEND_FLAGS to point to _rocm_sdk_core\lib\llvm\lib\clang\23\include, otherwise the build fails with fatal error: '__clang_hip_runtime_wrapper.h' file not found.

Did you install the "rocm-sdk-devel" pip wheel? it's part of the "rocm[libraries,devel]" when you to a pip install.
If that didn't help, try again on powershell/cmd in admin mode

0xDELUXA · 2026-05-14T14:53:11Z

Did you install the "rocm-sdk-devel" pip wheel? it's part of the "rocm[libraries,devel]" when you to a pip install. If that didn't help, try again on powershell/cmd in admin mode

I used this command to install: python -m pip install --pre --index-url https://rocm.nightlies.amd.com/v2-staging/gfx120X-all/ torch torchvision torchaudio rocm[devel,libraries]

jammm · 2026-05-14T14:53:57Z

Did you install the "rocm-sdk-devel" pip wheel? it's part of the "rocm[libraries,devel]" when you to a pip install. If that didn't help, try again on powershell/cmd in admin mode

I used this command to install: python -m pip install --pre --index-url https://rocm.nightlies.amd.com/v2-staging/gfx120X-all/ torch torchvision torchaudio rocm[devel,libraries]

Can you try pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudio rocm[libraries,devel] in a fresh venv?

jammm · 2026-05-14T15:19:27Z

Did you install the "rocm-sdk-devel" pip wheel? it's part of the "rocm[libraries,devel]" when you to a pip install. If that didn't help, try again on powershell/cmd in admin mode

I used this command to install: python -m pip install --pre --index-url https://rocm.nightlies.amd.com/v2-staging/gfx120X-all/ torch torchvision torchaudio rocm[devel,libraries]

Can you try pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudio rocm[libraries,devel] in a fresh venv?

Actually, this one goes into an endless loop, so don't try that. Do pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ rocm[libraries,devel] first, then pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudio

0xDELUXA · 2026-05-14T15:19:45Z

Can you try pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudio rocm[libraries,devel] in a fresh venv?

This installs ROCm 7.13.0a20260416, and after running rocm-sdk init, the _rocm_sdk_devel/lib/llvm/lib/clang/23/include folder is actually created.

jammm · 2026-05-14T15:20:53Z

Can you try pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudio rocm[libraries,devel] in a fresh venv?

This installs ROCm 7.13.0a20260416, and after running rocm-sdk init, the _rocm_sdk_devel/lib/llvm/lib/clang/23/include folder is actually created.

Yeah that should work. I use the same wheels.

0xDELUXA · 2026-05-14T15:22:10Z

No errors when building now. But these wheels are close to a month old. Don’t we have any newer ones that are “stable”?

jammm · 2026-05-14T15:29:26Z

They're in the index but not being picked up for some reason. These ones are from may 11
pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ "rocm[libraries,devel]==7.13.0a20260511" "torch==2.10.0+rocm7.13.0a20260511" "torchaudio==2.10.0+rocm7.13.0a20260511" "torchvision==0.25.0+rocm7.13.0a20260511"

0xDELUXA · 2026-05-14T16:03:36Z

I don't really understand what you meant by:

With `--use-sage-attention`, ComfyUI reported `Using xformers attention in VAE`.

in #368 (comment).

AFAIK, if xformers isn't installed as a standalone package, ComfyUI won't print that message. For me, regardless of the attention backend used (SDPA, Sage, Flash), it always prints Using split attention in VAE unless I install xformers separately. This is partly why I created this node.

Could this be happening on your end because you're also working on ROCm/xformers#87?

0xDELUXA · 2026-05-15T14:39:04Z

Out of curiosity, is there a way to verify that the SA v2 path is actually used?

According to the PR description, it prioritizes this implementation over Triton on RDNA4:

automatic gfx12 runtime dispatch from the public sageattn API

trfmk123 · 2026-05-15T14:57:47Z

@trfmk123 Which FlashAttention version are you using? I assume it isn't the latest one from aiter. Are you using v2.8.3? Also, the script shouldn't print "FlashAttention-2 CK" there at all.

my setup is running Flash Attention 2.8.4 via the composable_kernel (ck) backend.

0xDELUXA · 2026-05-15T15:02:06Z

Quick script I used to bench this native Sage 2.2.0 implementation against Flash 2.8.4 (aiter triton) on Windows.

Results:

(venv) PS C:\> python bench_sage2_gfx12_vs_fa2.py

Device : AMD Radeon RX 9060 XT  (gfx1200)
Config : dtype=fp16  causal=False  B=1  Hq=32  Hkv=32  warmup=50  iters=200

Loading implementations …
  [sage] using sageattn_qk_int8_pv_gfx12_native (direct)
C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_interface.py:17: UserWarning: flash_attn_2_cuda (which has ROCm/HIP kernels) not found, falling back to Triton implementation
  warnings.warn("flash_attn_2_cuda (which has ROCm/HIP kernels) not found, falling back to Triton implementation")
[aiter] Windows: CK and HIP ops are not available. Triton ops only.
  [fa2]  using flash_attn.flash_attn_func

     S     D    Sage ms     FA2 ms   Sage TF/s    FA2 TF/s   Speedup
────────────────────────────────────────────────────────────────────
   512    64      0.240      0.320        8.96        6.71    1.335x
  1024    64      0.391      0.567       21.95       15.15    1.449x
  2048    64      0.710      1.284       48.37       26.75    1.808x
  4096    64      2.110      4.218       65.14       32.58    1.999x
  8192    64      7.314     16.832       75.16       32.66    2.301x
   512   128      0.290      0.398       14.81       10.80    1.371x
  1024   128      0.603      0.797       28.47       21.55    1.322x
  2048   128      1.444      2.117       47.59       32.46    1.466x
  4096   128      4.216      8.041       65.20       34.19    1.907x
  8192   128     14.962     31.508       73.49       34.90    2.106x
  [sage] S=1024 D=256 FAILED: gfx12 fp8 value path currently supports head_dim 16, 64, or 128.
  1024   256      n/a      1.654      n/a       20.77      n/a
  [sage] S=2048 D=256 FAILED: gfx12 fp8 value path currently supports head_dim 16, 64, or 128.
  2048   256      n/a      4.876      n/a       28.19      n/a
  [sage] S=4096 D=256 FAILED: gfx12 fp8 value path currently supports head_dim 16, 64, or 128.
  4096   256      n/a     16.888      n/a       32.55      n/a
────────────────────────────────────────────────────────────────────

SageAttn gfx12 wins 10/10 configs  |  avg speedup vs FA2: 1.706x
→ SageAttention gfx12 native is faster on average.

── Numerical sanity check (S=1024, D=64) ──
  max |sage - fa2| : 0.02026
  mean|sage - fa2| : 0.001528
  ✓ outputs are numerically close (expected for INT8 quant)

Quick chart:

Great results overall! 🚀

@jammm Is there a limitation preventing support for head_dim 256?

0xDELUXA · 2026-05-15T15:02:52Z

my setup is running Flash Attention 2.8.4 via the composable_kernel (ck) backend.

Oh, I thought you're on Windows - never mind then. My point was that if someone doesn't have FA-2 CK, it automatically uses Triton, but the script still prints CK, which is a bit misleading.

jammm · 2026-05-15T15:18:53Z

@jammm Is there a limitation preventing support for head_dim 256?

There is. I'm working on bringing parity vs the CUDA path.
Having said that, the CUDA path also rejects head_dim > 128, so this is not specific to gfx12. It's the blackwell specific code (sageattention v3) which supports 256. The scope of this PR is sageattention v2 though, since sageattention v3 needs fp4 support, which RDNA4 doesn't have.

0xDELUXA · 2026-05-15T15:28:48Z

@jammm Is there a limitation preventing support for head_dim 256?

There is. I'm working on bringing parity vs the CUDA path.
Having said that, the CUDA path also rejects head_dim > 128, so this is not specific to gfx12. It's the blackwell specific code (sageattention v3) which supports 256. The scope of this PR is sageattention v2 though, since sageattention v3 needs fp4 support, which RDNA4 doesn't have.

I see.

Again, huge respect for the work you put into this.

At first, we only had SDPA Math on RDNA4 (Windows).
After that, you helped make AOTriton a thing.
Later, your contributions to triton-windows made Sage1 and FA2 possible to use.
And now here we are with Sage2.

Oh, and it all started with your PyTorch wheels on Windows. 👀

trfmk123 · 2026-05-15T16:15:31Z

@0xDELUXA

Oh, I thought you're on Windows - never mind then. My point was that if someone doesn't have FA-2 CK, it automatically uses Triton, but the script still prints CK, which is a bit misleading.

My OS is Windows 11. Below are my test results.

0xDELUXA · 2026-05-15T16:24:58Z

My OS is Windows 11. Below are my test results.

So you’ve built FA2 CK on Windows? Not many people have done this...

Based on your results, FA2 CK outperforms Sage2 at smaller batch sizes (which wasn't the case for FA2 Triton), while Sage2 takes the lead at larger ones.

trfmk123 · 2026-05-15T16:34:36Z

@0xDELUXA

So you’ve built FA2 CK on Windows? Not many people have done this...

Based on your results, FA2 CK outperforms Sage2 at smaller batch sizes (which wasn't the case for FA2 Triton), while Sage2 takes the lead at larger ones.

I referred to this Japanese article to compile Flash Attention 2.8.4 (CK backend) on Windows:
https://note.com/lpp/n/nf24da8645c3c

0xDELUXA · 2026-05-15T16:51:51Z

I referred to this Japanese article to compile Flash Attention 2.8.4 (CK backend) on Windows: https://note.com/lpp/n/nf24da8645c3c

It's kind of all over the place with questions like "Can fa4-v4.0.0.beta be built?" The answer is no - it can't. We can build FA3, though. It also has FP8 support on RDNA4, for example. Also, that setup.py patch isn't really needed after Dao-AILab/flash-attention#2517.

crashingalexsan · 2026-05-15T20:24:50Z

Finally able to compile Windows multi arch

@jammm I had to remove the include dirs

I was hitting duplicate headers (sent you a Discord message about that). Not sure if related to torch, multiarch or what (fresh venv)

Compared against FA2 CK backend. Similar results to @trfmk123

Using the Torchmark script

woct0rdho · 2026-05-16T15:36:22Z

Is there a limitation preventing support for head_dim 256?

@0xDELUXA FYI, head dim 256 is impossible on Nvidia sm90 (without some splitting) due to constraint of WGMMA size, and it's possible on Nvidia consumer architectures. I've tried it in #329 . I haven't tried how to do it on AMD GPU.

qweqweewqe7-create · 2026-05-17T01:47:22Z

Windows ROCm Build - Image Corruption Bug

Hi @jammm, I successfully built SageAttention on Windows with ROCm 7.13 + gfx1201 (RX 9070 XT), but the generated images come out as pure noise.

Environment

OS: Windows 11
GPU: AMD RX 9070 XT (gfx1201)
ROCm: 7.13.0a20260511
PyTorch: 2.10.0+rocm7.13
Python: 3.12

Build Issues & Workarounds

The code couldn't be compiled as-is on Windows due to math function resolution issues. The following functions were not recognized by the HIP compiler on Windows:

fmaxf, fabsf, fminf, nearbyintf — no matching function for call
max in reduction_utils_hip.cuh — use of undeclared identifier

Workaround used to get it to compile:

Replaced fmaxf( → __builtin_fmaxf( etc. in .cu source files
Replaced val = max(...) with ternary operator in reduction_utils_hip.cuh

Problem

With the above workaround, the build succeeds and SageAttention loads, but all generated images are corrupted (pure noise). Without --use-sage-attention, images generate correctly.

I also tried replacing with ::fmaxf(, __ocml_fmax_f32( etc. but the same corruption occurs.

This suggests the math function substitution is producing incorrect results in the GPU kernel, likely affecting the quantization scale calculation in qk_int_sv_gfx12_native.

Question

Is there a correct way to resolve fmaxf/fabsf on Windows HIP compiler? Or is there a known fix for this?

Speed improvement without the bug would be significant — the workflow went from 34s → 23s in testing with ROCm 7.2 triton fallback.

Thanks for the great work on this PR!

Update:

I tried again from a clean checkout and found that the pure-noise output was likely caused by my previous workaround using __builtin_fmaxf, __builtin_fabsf, etc.

A less aggressive workaround now builds and produces correct images on my setup.

Additional workaround that worked for me:

Removed explicit include_dirs=include_dirs from the ROCm CUDAExtension entries in setup.py
Added simple device helper functions for fabsf, fmaxf, fminf instead of replacing them with __builtin_*
Reused the existing v_cvt_i32_f32 inline asm rounding path instead of nearbyintf
Added local max / min helpers in reduction_utils.cuh
Replaced a few host-side std::max(q_heads, kv_heads) / ::max(...) grid dimension expressions with explicit ternaries, because hipify/Windows HIP produced ::max errors

After this, SageAttention imports and ComfyUI generates correct images.

Current result:

Baseline ComfyUI: ~34s
--use-sage-attention: ~28s
About 17.6% lower wall time / ~1.21x speedup

Environment:

Windows 11
RX 9070 XT / gfx1201
Python 3.12
PyTorch 2.10.0+rocm7.13.0a20260511
ROCm HIP 7.13.26176

One more Windows-specific issue: triton is not available for this environment via the AMD gfx120X wheel index, so importing sageattention failed at first because core.py imports the Triton backend unconditionally. I patched core.py locally to allow import without Triton and use only the gfx12 native backend.

So the current status is: the gfx12 native backend can work on Windows/gfx1201, but Windows HIP needs a few compile/import fixes. The previous pure-noise result should probably be treated as caused by my bad math-function workaround, not necessarily a kernel bug.

0xDELUXA · 2026-05-17T08:06:04Z

@qweqweewqe7-create This looks like an environment-specific issue on your side rather than a universal problem. Also, triton-windows needs to be installed separately in the venv - not specifically for this PR, but in general.

DrywFiltiarn · 2026-05-17T08:23:15Z

@qweqweewqe7-create over the past days I have compiled it several times using nightlies of 20260416 which has a newer torch not currently available with newer nightly builds, and I had no issues with compiling nor running it. It might be the specific nightly you're trying to build against is (partially) broken?

DrywFiltiarn · 2026-05-19T17:03:20Z

Been busy I see @jammm. Curious about all of the changes and whether it adds further improvements to the performance.

BrummBearV2 · 2026-05-19T22:42:18Z

Hello there. Im trying to get this to work for the third day now and since im unable to i guess ill ask if there is someone willing to make a "for dummies" guide? as a side note: ive never build something and have no real idea why the build process fails.

this is my .bat

cd G:\ComfyUI
git clone https://github.com/jammm/SageAttention.git SageAttention-jammm
cd SageAttention-jammm
git checkout jam/gfx12

Activate Visual Studio environment

cmd /c '"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1 && set' | ForEach-Object { if ($_ -match '^([^=]+)=(.*)$') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process') } }

::set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0
::set CUDA_PATH=%CUDA_HOME%
set PATH=%CUDA_HOME%\bin;%PATH%
set ROCM_HOME=C:\Program Files\AMD\ROCm\7.1
set HIP_PATH=C:\Program Files\AMD\ROCm\7.1
set INCLUDE=G:\ComfyUI\python_embeded\include;G:\ComfyUI\python_embeded\Include;%INCLUDE%
set CUDA_HOME=C:\Program Files\AMD\ROCm\7.1
set "ROCM_HOME=C:\Program Files\AMD\ROCm\7.1"
set INCLUDE=G:\ComfyUI\python_embeded\include;G:\ComfyUI\python_embeded\Include;%INCLUDE%

$env:PYTORCH_ROCM_ARCH = "gfx1201"

:: Move into your SageAttention source directory
cd G:\ComfyUI\SageAttention-jammm

python -m pip install --upgrade setuptools wheel
python setup.py sdist bdist_wheel

:: Force a clean install using ComfyUI's python
..\python_embeded\python.exe -m pip install . --force-reinstall

pause

and this is the error:
Note: including file: G:\ComfyUI\SageAttention-jammm\csrc\fused\fused.h
1 error generated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "G:\ComfyUI\python_embeded\Lib\site-packages\torch\utils\cpp_extension.py", line 2597, in _run_ninja_build
subprocess.run(
File "subprocess.py", line 571, in run
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '32']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "G:\ComfyUI\python_embeded\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 389, in <module>
      main()
    File "G:\ComfyUI\python_embeded\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 373, in main
      json_out["return_val"] = hook(**hook_input["kwargs"])
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "G:\ComfyUI\python_embeded\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 280, in build_wheel
      return _build_backend().build_wheel(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\build_meta.py", line 441, in build_wheel
      return _build(['bdist_wheel', '--dist-info-dir', str(metadata_directory)])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\build_meta.py", line 429, in _build
      return self._build_with_temp_dir(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\build_meta.py", line 410, in _build_with_temp_dir
      self.run_setup()
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\build_meta.py", line 317, in run_setup
      exec(code, locals())
    File "<string>", line 467, in <module>
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\__init__.py", line 117, in setup
      return distutils.core.setup(**attrs)  # type: ignore[return-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\core.py", line 186, in setup
      return run_commands(dist)
             ^^^^^^^^^^^^^^^^^^
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\core.py", line 202, in run_commands
      dist.run_commands()
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\dist.py", line 1000, in run_commands
      self.run_command(cmd)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\dist.py", line 1107, in run_command
      super().run_command(command)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\dist.py", line 1019, in run_command
      cmd_obj.run()
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\command\bdist_wheel.py", line 370, in run
      self.run_command("build")
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\cmd.py", line 341, in run_command
      self.distribution.run_command(command)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\dist.py", line 1107, in run_command
      super().run_command(command)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\dist.py", line 1019, in run_command
      cmd_obj.run()
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\command\build.py", line 135, in run
      self.run_command(cmd_name)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\cmd.py", line 341, in run_command
      self.distribution.run_command(command)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\dist.py", line 1107, in run_command
      super().run_command(command)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\dist.py", line 1019, in run_command
      cmd_obj.run()
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\command\build_ext.py", line 97, in run
      _build_ext.run(self)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 367, in run
      self.build_extensions()
    File "G:\ComfyUI\python_embeded\Lib\site-packages\torch\utils\cpp_extension.py", line 1082, in build_extensions
      build_ext.build_extensions(self)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 481, in build_extensions
      self._build_extensions_parallel()
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 504, in _build_extensions_parallel
      fut.result()
    File "concurrent\futures\_base.py", line 456, in result
    File "concurrent\futures\_base.py", line 401, in __get_result
    File "concurrent\futures\thread.py", line 59, in run
    File "<string>", line 224, in build_extension
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\command\build_ext.py", line 262, in build_extension
      _build_ext.build_extension(self, ext)
    File "G:\ComfyUI\python_embeded\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 564, in build_extension
      objects = self.compiler.compile(
                ^^^^^^^^^^^^^^^^^^^^^^
    File "<string>", line 215, in compile_new
    File "G:\ComfyUI\python_embeded\Lib\site-packages\torch\utils\cpp_extension.py", line 1051, in win_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "G:\ComfyUI\python_embeded\Lib\site-packages\torch\utils\cpp_extension.py", line 2223, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "G:\ComfyUI\python_embeded\Lib\site-packages\torch\utils\cpp_extension.py", line 2614, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for sageattention
Failed to build sageattention
error: failed-wheel-build-for-install

× Failed to build installable wheels for some pyproject.toml based projects
╰─> sageattention

G:\ComfyUI\SageAttention-jammm>

Add native gfx12 attention backend

87781e3

jammm force-pushed the jam/gfx12 branch from d8bc54a to 87781e3 Compare May 13, 2026 19:19

jammm added 4 commits May 14, 2026 10:33

Prune unused gfx12 native code

9837bb5

Prune gfx12 native experimental dispatch

f984012

Prune dead gfx12 native kernel variants

801352d

Simplify ROCm build setup

34802a2

jammm force-pushed the jam/gfx12 branch from 9a411ad to 34802a2 Compare May 14, 2026 08:58

Support gfx12 native sequence tails

6b71b93

Use standard gfx12 kernel launch syntax

902b83d

Restore gfx12 smooth attention defaults

b5fb2d5

Optimize gfx12 smooth fp8 NHD path

0f3024f

Close gfx12 SageAttention2 parity gaps

1de8769

jammm added 3 commits May 16, 2026 02:41

Match CUDA smooth_v handling on gfx12

4ac1a9b

Complete gfx12 SageAttention2 API parity

5d6c3ca

Tighten gfx12 SageAttention2 parity

7dca990

jammm added 9 commits May 17, 2026 17:46

Optimize and split gfx12 SageAttention2 kernels

7ccc528

Improve gfx12 fp16 short paths

fd0b8ef

Improve gfx12 fp16 short static routes

3b23a20

Optimize gfx12 fp8 wrapper and fp16 D128 path

c553748

Use static NHD fp16 path for D64 long gfx12

f957387

Tune gfx12 D128 fp16 quant packing

c91aef8

Improve gfx12 fp16 stream routes

618fd87

gfx12: optimize short fp16 native paths

fc8bd21

gfx12: specialize D64 long causal NHD mean

30c94f5

Conversation

jammm commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Build

Windows

Linux

Correctness

Performance

ComfyUI Wan2.1 fp8 Workload

Versus FlashAttention

Versus SageAttention v1 fp8

Uh oh!

jammm commented May 14, 2026

Uh oh!

0xDELUXA commented May 14, 2026

Uh oh!

trfmk123 commented May 14, 2026

Uh oh!

jammm commented May 14, 2026

Uh oh!

jammm commented May 14, 2026

Uh oh!

trfmk123 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented May 14, 2026

Uh oh!

trfmk123 commented May 14, 2026

Uh oh!

jammm commented May 14, 2026

Uh oh!

0xDELUXA commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented May 14, 2026

Uh oh!

jammm commented May 14, 2026

Uh oh!

jammm commented May 14, 2026

Uh oh!

0xDELUXA commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented May 14, 2026

Uh oh!

0xDELUXA commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented May 14, 2026

Uh oh!

0xDELUXA commented May 14, 2026

Uh oh!

0xDELUXA commented May 15, 2026

Uh oh!

trfmk123 commented May 15, 2026

Uh oh!

0xDELUXA commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trfmk123 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trfmk123 commented May 15, 2026

Uh oh!

jammm commented May 13, 2026 •

edited

Loading

trfmk123 commented May 14, 2026 •

edited

Loading

0xDELUXA commented May 14, 2026 •

edited

Loading

jammm commented May 14, 2026 •

edited

Loading

0xDELUXA commented May 14, 2026 •

edited

Loading

0xDELUXA commented May 14, 2026 •

edited

Loading

0xDELUXA commented May 15, 2026 •

edited

Loading

0xDELUXA commented May 15, 2026 •

edited

Loading

jammm commented May 15, 2026 •

edited

Loading

0xDELUXA commented May 15, 2026 •

edited

Loading

trfmk123 commented May 15, 2026 •

edited

Loading

0xDELUXA commented May 15, 2026 •

edited

Loading

0xDELUXA commented May 15, 2026 •

edited

Loading

qweqweewqe7-create commented May 17, 2026 •

edited

Loading

0xDELUXA commented May 17, 2026 •

edited

Loading

BrummBearV2 commented May 19, 2026 •

edited

Loading